Does high number of output labels affect the performance of BERT and how to handle the class imbalance issue while doing multi text classification?

swagat1509 · March 15, 2023, 11:35am

I am using BERT to do multiclass text classification. The number of output classes I have to predict from is: 116 and there is high degree of class imbalance that I see.
We have the following kind of records available for each of the classes:
{‘Class A’: 975 number of records,
‘Class B’: 776 number of records,
‘Class C’: 533 number of records,
‘Class D’: 412 number of records,
‘Class E’: 302 number of records,
‘Class F’: 250 number of records,
‘Class G’: 207 number of records,
‘Class H’: 137 number of records,
‘Class I’: 96 number of records,
‘Class J’: 51 number of records,
‘Class K’: 28 number of records,
‘Class L’: 17 number of records,
‘Class M’: 7 number of records,
‘Class N’: 2 number of records}

So I have two questions here:
Question1: As we have around 116 output classes to predict from, does that affect the performance of BERT due to the high number of output classes?

Question2: My original data has the similar type of class distribution that I have illustrated above. So how does this affect the performance of BERT and if it affects how do we handle this to get proper output?

Looking forward to get answer from the talented community we have here.

Much thanks in advance.

nikhilhuggingface96 · May 14, 2025, 7:16am

@swagat1509 Were you able to solve this ? I have the same scenario with around 106 classes, and highly imbalanced dataset, like 23k records for some class, and 2 records for some other class. I tried different models like distilbert-base-uncased, bert-base, deberta, roberta, bigbird, with different hyperparameter combinations, and different loss functions like focal loss, weighted loss etc., but I am not able to break the accuracy mark of 84 %. Please reply, if possible. Also, if someone else can help me in this scenario, your help would be greatly appreciated

John6666 · May 14, 2025, 11:32am

He seems to have gotten the answer itself. It doesn’t seem easy to improve performance…
https://datascience.stackexchange.com/questions/120215/does-high-number-of-output-labels-affect-the-performance-of-bert-and-how-to-hand

Topic		Replies	Views
Handling Extreme Class Imbalance for Multi-Class Classification Intermediate	1	27	May 14, 2025
BERT Multilabel - Different Training Dataset For Each Label? Intermediate	3	1300	December 27, 2021
Dealing with Imbalanced Datasets? Research	1	5427	March 11, 2021
Multi-class Classification Basics Beginners	4	4445	August 24, 2021
Multiclass Classification: "labels" format Beginners	0	664	October 26, 2022

Does high number of output labels affect the performance of BERT and how to handle the class imbalance issue while doing multi text classification?

Related topics